Assignment 2

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.
Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.
The assignment is worth 100 points, and is due on Sunday, 4th February 2024 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
- Final answers to each question are written in the Markdown cells. (1 point)
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
The maximum possible score in the assigment is 105 + 5 (proper formatting) = 110 out of 100.

1) Multiple Linear Regression (24 points)

A study was conducted on 97 male patients with prostate cancer who were due to receive a radical prostatectomy (complete removal of the prostate). The prostate.csv file contains data on 9 measurements taken from these 97 patients. Each row (observation) represents a patient and each column (variable) represents a measurement. The description of variables can be found here: https://rafalab.github.io/pages/649/prostate.html

1a)

Fit a linear regression model with lpsa as the response and all the other variables as the predictors. Print its summary. (2 points) Write down the optimal equation that predicts lpsa using the predictors. (2 points)

1b)

Is the overall regression statistically significant? In other words, is there a statistically significant relationship between the response and at least one predictor? You need to justify your answer for credit. (2 points)

1c)

What does the optimal coefficient of svi tell us as a numeric output? Make sure you include the predictor, (svi) the response (lpsa) and the other predictors in your answer. (2 points)

1d)

Check the \(p\)-values of gleason and age. Are these predictors statistically significant? You need to justify your answer for credit. (2 points)

1e)

Check the 95% Confidence Interval of age. How can you relate it to its p-value and statistical significance, which you found in the previous part? (2 points)

1f)

This question requires some thinking, and bringing your 303-1 and 303-2 knowledge together.

Fit a simple linear regression model on lpsa against gleason and check the \(p\)-value of gleason using the summary. (2 point) Did the statistical significance of gleason change in the absence of other predictors? (1 point) Why or why not? (3 points)

Hints:

You need to compare this model with the Multiple Linear Regression model you created above.
Printing a correlation matrix of all the predictors should be useful.

1g)

Predict the lpsa of a 65 year old man with lcavol = 1.35, lweight = 3.65, lbph = 0.1, svi = 0.22, lcp = -0.18, gleason = 6.75, and pgg45 = 25. Find the 95% confidence and prediction intervals as well. (2 points)

1h)

In the Multiple Linear Regression model with all the predictors, you should see a total of five predictors that appear to be statistically insignificant. Why is it not a good idea to directly conclude that all of them are statistically insignificant? (2 points) Implement the additional test that concludes the statistical insignificance of all five predictors. (2 points)

Hint: f_test() method

2) Multiple Linear Regression with Variable Transformations (22 points)

The infmort.csv file has the infant mortality data of different countries in the world. The mortality column represents the infant mortality rate with “deaths per 1000 infants” as the unit. The income column represents the per capita income in USD. The other columns should be self-explanatory. (This is an old dataset, as can be seen from some country names.)

2a)

Start your analysis by creating (i) a boxplot of log(mortality) for each region and (ii) a boxplot of income for each region. Note that the region column has the continent names. (3 points)

Note: You need to use np.log, which is the natural log. This is to better distinguish the mortality values.

2b)

In the previous part, you should see that Europe has the lowest infant mortality rate on average, but it also has the highest per capita income on average. Our goal is to see if Europe still has the lowest mortality rate if we remove the effect of income. We will try to find an answer for the rest of this question.

Create four scatter plots: (i) mortality against income, (ii) log(mortality) against income, (iii) mortality against log(income), and (iv) log(mortality) against log(income). (3 points) Based on the plots, create an appropriate model to predict the mortality rate as a function of per capita income. Print the model summary. (2 points)

2c)

Update the model you created in the previous part by adding region as a predictor. Print the model summary. (2 points)

2d)

Use the model developed in the previous part to compute a new adjusted_mortality variable for each observation in the data. (5 points) Adjusted mortality rate is the mortality rate after removing the estimated effect of income. You need to calculate it with the following steps:

Multiply the (transformed) income column with its optimal coefficient. This is the estimated effect of income.
Subtract the product from the (transformed) response column. This removes the estimated effect of income.
You may need to do a inverse transformation to calculate the actual adjusted mortality rate values.

Make a boxplot of log(adjusted_mortality) for each region. (2 points)

2e)

Using the plots in parts a and d, answer the following questions:

Does Europe still have the lowest mortality rate on average after removing the effect of income?
How did the distribution of values among different continents change after removing the effect of income? How did the comparison of different continents change? Does any non-European country have a lower mortality rate than all the European countries after removing the effect of income?

(5 points)

3) Variable Transformations and Interactions (38 points)

The soc_ind.csv dataset contains many social indicators of a number of countries. Each row is a country and each column is a social indicator. The column names should be clear on what the variables represent. The GDP per capita will be the response variable throughout this question.

3a)

Using correlations, find out the most useful predictor for a simple linear regression model with gdpPerCapita as the response. You can ignore categorical variables for now. Let that predictor be \(P\). (2 points)

3b)

Create a scatterplot of gdpPerCapita vs \(P\). Does the relationship between gdpPerCapita and \(P\) seem linear or non-linear? (2 points)

3c)

If the relationship in the previous part is non-linear, create three models:

Only with \(P\)
With \(P\) and its quadratic term
With \(P\), its quadratic term and its cubic term

(2x3 = 6 points)

Compare the \(R\)-squared values of the models. (2 points)

3d)

On the same figure:

create the scatterplot in part b.
plot the linear regression line (only using \(P\))
plot the polynomial regression curve that includes the quadratic and cubic terms.
add a legend to distinguish the linear and cubic fits.

(6 points)

3e)

Develop a model to predict gdpPerCapita using \(P\) and continent as predictors. (No higher-order terms.)

Which continent creates the baseline? (2 points) Write down its equation. (2 points)
For a given value of \(P\), are there any continents that do not have a statistically significant difference of predicted gdpPerCapita from the baseline continent? If yes, then which ones, and why? If no, then why not? You need to justify your answers for credit. (4 points)

3f)

The model developed in the previous part has a limitation. It assumes that the increase in predicted gdpPerCapita with a unit increase in \(P\) does not depend on the continent.

Eliminate this limitation by including the interaction of continent with \(P\) in the model. Print the model summary of the model with interactions. (2 points) Which continent has the closest increase in predicted gdpPerCapita to the baseline continent with a unit increase in \(P\). Which continent has the furthest? You need to justify your answers for credit. (5 points)

3g)

Using the model developed in the previous part, plot the regression lines of all the continents on the same figure. Put gdpPerCapita on the y-axis and \(P\) on the x-axis. (4 points) Use a legend to color-code the continents. (1 point)

4) Prediction with Sklearn (21 points)

Using the soc_ind.csv dataset and only sklearn and pandas, train a Linear Regression model. You need the following steps:

gdpPerCapita is the response. (2 points)
Index, geographic_location and country columns are not necessary. (2 points)
All the remaining columns are predictors. (2 points)
continent column needs to be one-hot-encoded. (2 points)
Since the numeric values have different orders of magnitude, you need to scale the dataset. You can use StandardScaler from sklearn.preprocessing for this. Create an object (just like a model) and use .fit_transform with the data as the input. (4 points)
Train a LinearRegression model. Use the entire dataset as the training data. (3 points)
Get the predictions for the training data. (3 points)
Calculate the RMSE and MAE. (3 points)

For this question, you only need to calculate the training performance. In the future, we will see how to split a dataset into training and test sets.